Common Crawl
https://commoncrawl.org/
#LLM
の訓練に使われる(
#ChatGPT
)
入手は
https://data.commoncrawl.org/
から
https://ja.wikipedia.org/wiki/コモン%E3%83%BBクロール
https://commoncrawl.github.io/cc-crawl-statistics/
Languages
https://commoncrawl.github.io/cc-crawl-statistics/plots/languages
The language of a document is identified by Compact Language Detector 2 (CLD2).
👉
cld2